New types of data and new data science technologies enable new research. These new technologies are technologies such as the ability to combine existing data or the ability to generate synthetic data from existing knowledge. This week casus is based on such research. Data is generated by Synthea's COVID-19 module. The data was constructed using three peer-reviewed publications published in the early stages of the global pandemic, when less was known, along with emerging resources, data, publications, and clinical knowledge. The simulation outputs synthetic Electronic Health Records (EHR), including the daily consumption of Personal Protective Equipment (PPE) and other medical devices and supplies. The Data is stored in separate tables to avoid redundancy, with as a concequence that tables need to be combined and reorganized in dataframes for analysing purpose.
Keywords: merge data, subset data, clean data, generate data
You will learn about combining data with pandas and numpy and you will learn to visualize with bokeh. Concretely, you will preprocess the partly Synthetic Covid data in an appropiate format in order to conduct statistical and visual analysis. Learning objectives
Tutorials about combining data: https://github.com/fenna/BFVM22PROG1/blob/main/tutorials/tutorial_combine_data.ipynb
study case combining data:https://github.com/fenna/BFVM22PROG1/blob/main/study_cases/adults_who_binge_drank_in_hot_towns.ipynb
Please add the topics you want to learn about here: https://padlet.com/ffeenstra1/kzh2chaqleq3iovu
Your job is to visualize the lab values taken for COVID-19 patients of survived versus not survived patients.
The assignment consists of 6 parts:
Part 1 and 4 are mandatory, part 5 is optional (bonus) Mind you that you cannot copy code without referencing the code. If you copy code you need to be able to explain your code verbally and you will not get the full score.
The data is generated by Synthea's COVID-19 module. The data was constructed using three peer-reviewed publications published in the early stages of the global pandemic, when less was known, along with emerging resources, data, publications, and clinical knowledge. The simulation outputs synthetic Electronic Health Records (EHR), including the daily consumption of Personal Protective Equipment (PPE) and other medical devices and supplies. For this assignment the conditions, patients, observations, careplans and encounters table will be used. The Data is stored in separate tables to avoid redundancy, with as a concequence that tables need to be combined and reorganized in dataframes for analysing purpose.
Source: Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007
Please download the data
Patients are considered Covid patients if they are identified with CODE 840539006
Patients that had covid and where tested negative after isolation have tested code 94531-1, SARS-CoV-2 RNA Pnl Resp NAA+probe (covid-sars test) + a value of Not detected (qualifier value). These patients are considered to be survived covid patients.
Patients that did not survived Covid have a DEATHDATE which is not null.
Patients are monitored for blood and heart conditions once they are admitted in Hospital or under treatment. The lab values of interest are as follow:
48065-7 Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma26881-3 Interleukin 6 [Mass/volume] in Serum or Plasma2276-4 Ferritin [Mass/volume] in Serum or Plasma89579-7 Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method731-0 Lymphocytes [#/volume] in Blood by Automated count14804-9 Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reactionInstructions: Load the data of the following files. Preferably we read the data not with a hard coded data path but using a config file. See https://fennaf.gitbook.io/bfvm22prog1/data-processing/configuration-files/yaml
Get yourself familiar with the data. Create some meaningful overviews. Answer the following questions
import yaml
import pandas as pd
import numpy as np
import datetime
with open('config.yaml') as stream:
config = yaml.safe_load(stream)
careplans = pd.read_csv(config['careplans'])
conditions = pd.read_csv(config['conditions'])
encounters = pd.read_csv(config['encounters'])
observations = pd.read_csv(config['observations'])
patients = pd.read_csv(config['patients'])
dfs = [careplans, conditions, encounters, observations, patients]
features = []
names = ['careplans', 'conditions', 'encounters', 'observations', 'patients']
i = 0
for df in dfs:
print(names[i])
i += 1
print(df.columns)
print('='*20)
careplans
Index(['Id', 'START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION',
'REASONCODE', 'REASONDESCRIPTION'],
dtype='object')
====================
conditions
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'], dtype='object')
====================
encounters
Index(['Id', 'START', 'STOP', 'PATIENT', 'ORGANIZATION', 'PROVIDER', 'PAYER',
'ENCOUNTERCLASS', 'CODE', 'DESCRIPTION', 'BASE_ENCOUNTER_COST',
'TOTAL_CLAIM_COST', 'PAYER_COVERAGE', 'REASONCODE',
'REASONDESCRIPTION'],
dtype='object')
====================
observations
Index(['DATE', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION', 'VALUE', 'UNITS',
'TYPE'],
dtype='object')
====================
patients
Index(['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX',
'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL', 'RACE', 'ETHNICITY',
'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE', 'COUNTY', 'ZIP',
'LAT', 'LON', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
dtype='object')
====================
num_pat = len(patients['Id'].unique())
num_cov = len(conditions[conditions['CODE'] == 840539006])
num_admitted = len(encounters[encounters[
'DESCRIPTION']=='Hospital admission for isolation (procedure)'])
num_died = len(patients[patients.DEATHDATE.notna()])
The following function needs to be called. You can use this as a test. There are however more meaningful overviews you can create.
def part1(num_pat, num_cov, num_admitted, num_died):
print(f'There are {num_pat} patients in total')
print(f'There are {num_cov} covid patients')
print(f'There are {num_admitted} admitted patients')
print(f'{num_died} patients died')
part1(num_pat, num_cov, num_admitted, num_died)
There are 12352 patients in total There are 8820 covid patients There are 1867 admitted patients 2352 patients died
In this part we are going to combine data to create a dataframe with values of interest for the lab values analysis.
We would like a dataframe containing the following information per record (only Covid patients!!!)
PATIENT - the ID of the covid patientdays - the number of days the patient is under observationCODE-Y - the code of the observation VALUE - the lab value of the observationwhere only the following observation codes needs to be selected:
48065-7 Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma26881-3 Interleukin 6 [Mass/volume] in Serum or Plasma2276-4 Ferritin [Mass/volume] in Serum or Plasma89579-7 Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method731-0 Lymphocytes [#/volume] in Blood by Automated count14804-9 Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reactionThe days information is not primarely available and needs to be calculated by substracting observation DATE - START.
An example of such a dataframe is given below:
#Possible approach:
#Select all the patients with covid from the conditions table
conditions[conditions['CODE'] == 840539006]
#Combine conditions table (only covid patients) with the patient table into a covid_patient table
covid_patients = pd.merge(left=conditions[conditions['CODE'] == 840539006],
right=patients, how='left',left_on='PATIENT',
right_on='Id')
#select the only the relevant lab observations from the observations table into a lab_obs table
obs_of_intrest = ['48065-7', '26881-3', '2276-4', '89579-7', '731-0', '14804-9']
lab_obs = observations[observations['CODE'].isin(obs_of_intrest)]
#merge the covid_patient table with the lab_obs table into a covid_patients_obs table
covid_patients_obs = pd.merge(covid_patients,lab_obs,on='PATIENT')
covid_patients_obs.columns
#clean the covid_patients_obs table (rename columns, select only relevant columns, sort, typecast, add days column)
covid_patients_obs = covid_patients_obs[['START', 'PATIENT','DATE','CODE_y', 'VALUE', 'UNITS']]
covid_patients_obs["DATE"] = pd.to_datetime(covid_patients_obs["DATE"])
covid_patients_obs["START"] = pd.to_datetime(covid_patients_obs["START"])
covid_patients_obs['VALUE'] = covid_patients_obs['VALUE'].astype(float)
covid_patients_obs['DAYS'] = covid_patients_obs["DATE"] - covid_patients_obs["START"]
covid_patients_obs = covid_patients_obs.rename(columns={'CODE_y':'CODE-Y'})
covid_patients_obs = covid_patients_obs.drop(['START', 'DATE'], axis=1)
covid_patients_obs = covid_patients_obs.sort_values(['PATIENT','DAYS'])
covid_patients_obs = covid_patients_obs[['PATIENT', 'DAYS', 'CODE-Y',
'VALUE','UNITS']]
covid_patients_obs.set_index(keys='PATIENT',inplace=True)
covid_patients_obs.head()
| DAYS | CODE-Y | VALUE | UNITS | |
|---|---|---|---|---|
| PATIENT | ||||
| 00079a57-24a8-430f-b4f8-a1cf34f90060 | 0 days | 731-0 | 1.1 | 10*3/uL |
| 00079a57-24a8-430f-b4f8-a1cf34f90060 | 0 days | 48065-7 | 0.4 | ug/mL |
| 00079a57-24a8-430f-b4f8-a1cf34f90060 | 0 days | 2276-4 | 489.5 | ug/L |
| 00079a57-24a8-430f-b4f8-a1cf34f90060 | 0 days | 89579-7 | 3.4 | pg/mL |
| 00079a57-24a8-430f-b4f8-a1cf34f90060 | 0 days | 14804-9 | 241.8 | U/L |
Now we have the required data we would like to split the data into survived and not survived. First we fetch all the ids of the survived and deceased patients. We can use these ids to select the records of the survived patients and the patients that did not survived
Your job is to split the data into survived and not survived records. There are multiple ways to do this. One way is the .isin() method
#the following code is given, RUN THIS CELL
#get survived and deceased ids
completed_isolation_patients = careplans[(careplans.CODE == 736376001) & (careplans.STOP.notna()) \
& (careplans.REASONCODE == 840539006)].PATIENT
negative_covid_patient_ids = observations[(observations.CODE == '94531-1') \
& (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()
survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)
deceased_ids = patients[patients.DEATHDATE.notna()].Id
covid_patients_obs['DECEASED'] = covid_patients_obs.index.isin(deceased_ids)
survived = len(covid_patients_obs[covid_patients_obs['DECEASED'] == False])
deceased = len(covid_patients_obs[covid_patients_obs['DECEASED'] == True])
def test3(survived, died):
print(f'patients records survived: {survived}, patients records deceased {died}')
#call the test3
test3(survived, deceased)
patients records survived: 57125, patients records deceased 16793
Create plots with the lab data, for each code one plot. Separate the survivors and the deceased by color. An example of such a plot is given below. You can create 6 plots in one grid (for each code one plot) or use a widget (for instance a drop down menu widget) to select a lab CODE. Plot on the x-axis the days, on the y-axis the VALUE. Use proper labels, titles and legends.

from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
from bokeh.io import output_notebook
from bokeh.plotting import ColumnDataSource
from bokeh.models import DatetimeTickFormatter
output_notebook()
# for ease of use
df = covid_patients_obs
df['DAYS'][df['DECEASED']==False]
descriptions = {'48065-7' : 'Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma',
'26881-3' : 'Interleukin 6 [Mass/volume] in Serum or Plasma',
'2276-4' : 'Ferritin [Mass/volume] in Serum or Plasma',
'89579-7' : 'Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method',
'731-0' : 'Lymphocytes [#/volume] in Blood by Automated count',
'14804-9' : 'Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction'
}
units = {'48065-7' : 'ug/mL',
'26881-3' : 'pg/mL',
'2276-4' : 'ug/L',
'89579-7' : 'pg/mL',
'731-0' : '10*3/uL',
'14804-9' : 'U/L'
}
from bokeh.models import Panel, Tabs
output_notebook()
tabs = []
for obs in obs_of_intrest:
p = figure(title=descriptions[obs],
x_axis_type='datetime', x_axis_label='Time in days',
y_axis_label=units[obs])
p.xaxis.formatter=DatetimeTickFormatter(days=["%d"])
p.circle(x=df['DAYS'][(df['CODE-Y']==obs) &
(df['DECEASED']==True)],
y=df['VALUE'][(df['CODE-Y']==obs) &
(df['DECEASED']==True)],
color='red', alpha=0.3, size=10, legend_label='Deceased')
p.circle(x=df['DAYS'][(df['CODE-Y']==obs) &
(df['DECEASED']==False)],
y=df['VALUE'][(df['CODE-Y']==obs) &
(df['DECEASED']==False)],
color='green', alpha=0.3, size=10, legend_label='Survived')
p.legend.title = 'Legend'
p.legend.location = 'top_left'
tabs.append(Panel(child=p, title=descriptions[obs].split()[0]))
show(Tabs(tabs=tabs))
This is a bonus part. Can you plot the patients location on a map? See also https://docs.bokeh.org/en/latest/docs/user_guide/geo.html
You can use either package folium or geopandas. You need the Latitude and Longitude information from the patient tabel
import folium
patients.head()
| Id | BIRTHDATE | DEATHDATE | SSN | DRIVERS | PASSPORT | PREFIX | FIRST | LAST | SUFFIX | ... | BIRTHPLACE | ADDRESS | CITY | STATE | COUNTY | ZIP | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | f0f3bc8d-ef38-49ce-a2bd-dfdda982b271 | 2017-08-24 | NaN | 999-68-6630 | NaN | NaN | NaN | Jacinto644 | Kris249 | NaN | ... | Beverly Massachusetts US | 888 Hickle Ferry Suite 38 | Springfield | Massachusetts | Hampden County | 1106.0 | 42.151961 | -72.598959 | 8446.49 | 1499.08 |
| 1 | 067318a4-db8f-447f-8b6e-f2f61e9baaa5 | 2016-08-01 | NaN | 999-15-5895 | NaN | NaN | NaN | Alva958 | Krajcik437 | NaN | ... | Boston Massachusetts US | 1048 Skiles Trailer | Walpole | Massachusetts | Norfolk County | 2081.0 | 42.177370 | -71.281353 | 89893.40 | 1845.72 |
| 2 | ae9efba3-ddc4-43f9-a781-f72019388548 | 1992-06-30 | NaN | 999-27-3385 | S99971451 | X53218815X | Mr. | Jayson808 | Fadel536 | NaN | ... | Springfield Massachusetts US | 1056 Harris Lane Suite 70 | Chicopee | Massachusetts | Hampden County | 1020.0 | 42.181642 | -72.608842 | 577445.86 | 3528.84 |
| 3 | 199c586f-af16-4091-9998-ee4cfc02ee7a | 2004-01-09 | NaN | 999-73-2461 | S99956432 | NaN | NaN | Jimmie93 | Harris789 | NaN | ... | Worcester Massachusetts US | 201 Mitchell Lodge Unit 67 | Pembroke | Massachusetts | Plymouth County | NaN | 42.075292 | -70.757035 | 336701.72 | 2705.64 |
| 4 | 353016ea-a0ff-4154-85bb-1cf8b6cedf20 | 1996-11-15 | NaN | 999-60-7372 | S99917327 | X58903159X | Mr. | Gregorio366 | Auer97 | NaN | ... | Patras Achaea GR | 1050 Lindgren Extension Apt 38 | Boston | Massachusetts | Suffolk County | 2135.0 | 42.352434 | -71.028610 | 484076.34 | 3043.04 |
5 rows × 25 columns
mapobj = folium.Map(location=[42.08610119303171, -71.60329922178451], zoom_start=8)
borderstyle = {
'color':'blue',
'fill':False,
'weight':3
}
url = 'https://raw.githubusercontent.com/glynnbird/usstatesgeojson/master/massachusetts.geojson'
folium.GeoJson(data=url, name='borders', style_function=lambda x: borderstyle).add_to(mapobj)
health_cost = folium.FeatureGroup('healthcosts').add_to(mapobj)
for i in range(0,len(patients)):
#get values
lat = patients.iloc[i]['LAT']
lon = patients.iloc[i]['LON']
costs = patients.iloc[i]['HEALTHCARE_EXPENSES']
radius = costs/5000
died = patients.iloc[i]['DEATHDATE']
color = 'lightgreen' if pd.isna(died) else 'pink'
#create circles
folium.Circle(
popup= f'costs: {costs}, deathdate: {died}',
location = [lat, lon],
radius = radius,
fill = True,
fill_opacitiy = 0.2,
color = color
).add_to(health_cost)
folium.LayerControl().add_to(mapobj)
mapobj